Goal: Load Twitter API into a data management system and write queries to retrieve data.
| user_id | screen_name | followers_count | friends_count | account_created_at | verified |
|---|---|---|---|---|---|
| x44196397 | elonmusk | 95589999 | 114 | 2009-06-02 | TRUE |
Apache Spark is an open-source engine for large-scale
parallel data processing known for its speed, ease of use, and
cutting-edge analytics. It provides high-level APIs in general-purpose
programming languages such as Scala, Python, and R, as well as an
optimization engine supporting standard data analysis methods.
Azure Databricks is an analytics platform based on
Microsoft Azure cloud services, providing the latest versions of Apache
Spark and allowing consistent integration with open source libraries.
Built with Spark capabilities, Databricks provides a cloud platform with
an interactive workspace and fully managed Spark clusters, allowing
users to work in a single, easy-to-use environment, create and configure
clusters in seconds, and quickly execute Spark code.
Twitter (Elon Musk 2015-2022):
Dataset of Elon Musk’s most recent Tweets during 2015-2022, stored in
RDS format, where each tweet is in its own separate row object. All
Tweets are collected, parsed, and plotted using rtweet in
R. In total, there are more than thousands of tweets in this dataset,
including retweets and replies. All objects are to go into a single
database.
| 1 | status_id | 14 | hashtags | 27 | quoted_followers_count | 40 | retweet_location |
| 2 | created_at | 15 | symbols | 28 | quoted_location | 41 | retweet_description |
| 3 | user_id | 16 | media_expanded_url | 29 | quoted_description | 42 | retweet_verified |
| 4 | screen_name | 17 | media_type | 30 | quoted_verified | 43 | name |
| 5 | text | 18 | mentions_screen_name | 31 | retweet_status_id | 44 | location |
| 6 | source | 19 | quoted_status_id | 32 | retweet_text | 45 | description |
| 7 | reply_to_screen_name | 20 | quoted_text | 33 | retweet_created_at | 46 | followers_count |
| 8 | is_quote | 21 | quoted_created_at | 34 | retweet_source | 47 | friends_count |
| 9 | is_retweet | 22 | quoted_source | 35 | retweet_favorite_count | 48 | account_created_at |
| 10 | favorite_count | 23 | quoted_favorite_count | 36 | retweet_retweet_count | 49 | verified |
| 11 | retweet_count | 24 | quoted_retweet_count | 37 | retweet_user_id | ||
| 12 | quote_count | 25 | quoted_user_id | 38 | retweet_screen_name | ||
| 13 | reply_count | 26 | quoted_screen_name | 39 | retweet_followers_count |
Once you have your twitter app setup and loaded the
rtweet package in R, you can run the following command to
generate a token for the Twitter data. The first thing that you need to
setup in your code is your authentication and then create a token that
authenticates access to tweets.
library(rtweet) # load rtweet package
twitter_token <- create_token(app = "mytwitterapp",
consumer_key = "api_key", consumer_secret = "api_secret_key",
access_token = "access_token", access_secret = "access_token_secret")
use_oauth_token(twitter_token) # authenticate via web browser
Now we search Twitter’s full archive API. We run the
search_fullarchive command in order to be able to access
historical tweets from a user. The example below captures all of Elon
Musk’s tweets from January 01, 2010 to May 28, 2022.
df <- search_fullarchive(q = "from:elonmusk", n = 10000, env_name = enviroment_name, fromDate = "201001010000", toDate = "202205280000")
| created_at | screen_name | text | favorite_count | retweet_count | quote_count | reply_count | is_quote | is_retweet |
|---|---|---|---|---|---|---|---|---|
| 2022-05-27 22:52:53 | elonmusk | Starship 24 rolls out to the pad at Starbase https://t.co/PGh6FY6x8w | 0 | 0 | 0 | 0 | FALSE | TRUE |
| 2022-05-27 22:16:31 | elonmusk | (westcoastbill?) It was 96% in favor of me, until I commented. Ironically, there might be more left-wing bots following me than right-wing haha. | 23603 | 1177 | 83 | 1466 | FALSE | FALSE |
| 2022-05-27 22:03:30 | elonmusk | (IheartTesla?) (Tesla?) V true | 1869 | 92 | 7 | 152 | FALSE | FALSE |
| 2022-05-27 20:59:20 | elonmusk | (EvaFoxU?) Yes | 3014 | 142 | 9 | 236 | FALSE | FALSE |
(path finding) Display the thread (replies) of tweets
(the tweet, time, id, in reply to id, user name with their screen name)
posted by Elon Musk with screen_name in the order in which they were
posted.(location) From which location have the tweets been
most actively posted (most number of tweets)?(hashtags) Which hashtags does Musk use the most, and
how many tweets are associated with these hashtags?(topics) What word does Musk mention the most in his
tweets? What company products does Musk mention the most in his tweets?
Products include Falcon 9, Starlink Satellites, Model 3 cars, etc.(trending) Are there any trends of what Musk tweets
about the company?(nature of engagement) What is the percentage of
different types of tweets (simple tweet, reply, retweet, quoted tweet)
to their overall number of tweets?Building a Databricks workspace using an Apache Spark cluster.
First, we create a Databricks workspace from the Azure portal and
then launch the workspace, which redirects us to the interactive
Databricks portal. We create a Spark cluster from the Databricks
interactive workspace and configure a notebook on the cluster. In the
notebook, we can use either PySpark or SparkR
to read data from a dataset into a Spark DataFrame. Using the Spark
DataFrame, we can run a Spark SQL job to query the data.
The Azure Databricks system stores this data file in FileStore, located in the FileStore/ folder, which we then use the following PySpark or SparkR command to read in the data as a dataframe.
Python Code:
df = spark.read.csv(path = 'dbfs:/FileStore/dfclean.csv',
header = "true", multiLine = "true")
# register table so it is accessible via SQL Context
df.createOrReplaceTempView('twitterTemp')
R Code:
require(SparkR)
df <- read.df("dbfs:/FileStore/dfclean.csv", source = "csv",
header="true", inferSchema = "true")
createOrReplaceTempView(df, "twitterTemp") # register table
FileStore is a specific folder in DBFS where users can import and store data files from their local computer into a DBFS web browser. The Spark API then reads the imported data into an Apache Spark DataFrame as shown above.
Users can perform relational procedures on DataFrames using a
domain-specific language (DSL) similar to R dataframes and Python
Pandas. DataFrames support standard relational operators, including
projection (select), filter (where),
join, and aggregations (group by).
Python Code:
mentionDF = spark.sql("SELECT created_at, text, \
mentions_user_id, mentions_screen_name \
FROM twitterTemp")
mentionDF.filter(mentionDF.mentions_user_id != "NA")
mentionDF.createOrReplaceTempView('mentionDF')
R Code:
require(SparkR)
mentionDF <- select(df, "created_at", "text",
"mentions_user_id", "mentions_screen_name")
mentionDF <- filter(mentionDF, mentionDF$mentions_user_id != "NA")
createOrReplaceTempView(mentionDF, "mentionDF")
| created_at | text | mentions_user_id | mentions_screen_name |
|---|---|---|---|
| 2011-12-21 11:13:59 | Yeah, this really is me, as my Mom (mayemusk?) will attest. Not sure I can handle just doing 140 char missives. Will put longer thoughts on G+ | x14171401 | mayemusk |
| 2011-12-27 19:54:15 | Vending Machine Attendant Admits B3 Selection Has Changed A Lot Over The Years http://t.co/nccSGzCQ #OnionInnovation | x14075928 | TheOnion |
| 2011-12-28 22:29:48 | Any Idiot Could Have Come Up With The Car http://t.co/e9cLgfEg #OnionInnovation | x14075928 | TheOnion |
| 2011-12-28 22:27:08 | (TheOnion?) So true :) | x14075928 | TheOnion |
SQL Query:
SELECT mentions_screen_name,
COUNT(*) AS n
FROM mentionDF
WHERE mentions_screen_name != 'NA'
GROUP BY mentions_screen_name
SORT BY n DESC;
When creating a managed table, Spark will manage both
the table data and the metadata. To obtain the percentage of different
types of tweets, we begin by creating a managed table for each tweet
type (simple, reply, retweet, quote). The following R command returns
Elon Musk’s different tweet types and overall frequency.
R Code:
retweets <- nrow(df[df$is_retweet == "TRUE", ])
quotes <- nrow(df[df$is_quote == "TRUE", ])
normal <- nrow(df[df$is_quote == "FALSE" & df$is_retweet == "FALSE", ])
replies <- nrow(df[df$reply_to_status_id != "NA", ])
data <- data.frame(type = c("retweet", "quote", "regular", "reply"),
n = c(retweets, quotes, normal, replies))
typeDF <- createDataFrame(data)
createOrReplaceTempView(typeDF, "typeDF")
So the different tweet types are retweet,
quoted, reply_to, and normal.
Now, let’s create a data table to display the information for each of Elon Musk’s retweets and query the results to obtain the number of times Musk retweeted each user. With these results, we can visualize the most frequent twitter users Musk retweets as shown in the above graphic report.
CREATE TABLE retweets (
retweet_status_id CHAR,
retweet_text TEXT,
retweet_created_at DATE,
retweet_source TEXT,
retweet_favorite_count INT,
retweet_retweet_count INT,
retweet_user_id CHAR,
retweet_screen_name TEXT,
retweet_name TEXT,
retweet_followers_count INT,
retweet_friends_count INT,
retweet_statuses_count INT,
retweet_location TEXT,
retweet_description TEXT,
retweet_verified BOOLEAN
);
# Load data into table
LOAD DATA INPATH '/user/dfclean' INTO TABLE retweets;
Retweets Table:
| created_at | text | screen_name | favorite_count | retweet_count | followers_count | location | description | verified |
|---|---|---|---|---|---|---|---|---|
| 2022-04-27 23:41:36 | Docking confirmed! https://t.co/RyPZBAv5Lo | SpaceX | 128109 | 6890 | 22548863 | Hawthorne, CA | SpaceX designs, manufactures and launches the world’s most advanced rockets and spacecraft | TRUE |
| 2020-05-14 23:22:56 | Breakthrough, Part Deux (LVCVA?) https://t.co/JQ7sjPXSfZ | boringcompany | 7278 | 649 | 849005 | Austin/Las Vegas | Beat the snail | TRUE |
| 2016-12-05 06:33:41 | We’re releasing Universe, a platform for measuring and training AI agents: https://t.co/bx7OjMDaJK | OpenAI | 3088 | 1759 | 748401 | NA | OpenAI’s mission is to ensure that artificial general intelligence benefits all of humanity. We’re hiring: http://openai.com/jobs | TRUE |
| 2016-03-31 20:10:58 | Longest line EVER #Model3 https://t.co/EafMhK3BDC | Tesla | 5487 | 3059 | 15719350 | NA | electric cars, giant batteries and solar | TRUE |
The following SQL query parses the created_at datetime
column to display the year, month, day, and hour for each tweet.
timeDF <- SparkR::select(df, c("status_id", "created_at"))
timeDF$created_at <- to_timestamp(timeDF$created_at)
timeDF$year <- year(timeDF$created_at)
timeDF$month <- date_format(to_date(timeDF$created_at), "MMMM")
timeDF$weekday <- date_format(to_date(timeDF$created_at), "EEEE")
createOrReplaceTempView(timeDF, "timeDF")
The resulting schema for the above dataframe consists of a string, datetime, integer, string, and string for the status_id, created_at, year, month, weekday columns, respectively.
| created_at | year | month | weekday | time |
|---|---|---|---|---|
| 2022-04-01 06:48:20 | 2022 | April | Friday | morning |
| 2021-10-28 05:13:27 | 2021 | October | Thursday | morning |
| 2021-01-13 21:20:05 | 2021 | January | Wednesday | night |
| 2020-04-21 22:12:20 | 2020 | April | Tuesday | night |
| 2020-01-22 05:49:57 | 2020 | January | Wednesday | morning |
| 2019-04-14 20:33:46 | 2019 | April | Sunday | night |
SQL Query:
SELECT weekday AS created_weekday, COUNT(*) AS n
FROM timeDF
GROUP BY created_weekday
ORDER BY n DESC;
Python Code:
import pyspark.sql.functions as f
textDF = df.select('text', 'status_id', 'created_at')
# Count and group word frequencies on text, when split by space comma
textDF.withColumn('word', f.explode(f.split(f.col('text'), ' '))) \
.groupBy('word') \
.count() \
.sort('count', ascending=False)
R Code:
reg <- c("&|<|>", "[^\x01-\x7F]",
"\\s?(f|ht)(tp)(s?)(://)([^\\.]*)[\\.|/](\\S*)")
df %>% select('text', 'status_id') %>%
mutate(text = str_remove_all(text, reg)) %>%
unnest_tokens(word, text, token = "tweets") %>%
filter(!word %in% stop_words$word,
!word %in% str_remove_all(stop_words$word, "'"),
str_detect(word, "[a-z]"),
!str_detect(word, "^#@\\S+")) %>%
count(word, sort = TRUE)
Here we use the syuzhet R package to iterate over a
vector of strings consisting of the text from all of Elon Musk’s tweets
in our dataset. To obtain the vector of tweet text, the plain_tweets()
function from the rtweet package is used to clean up the
tweets character vector to cleaned up, plain text. We then pass this
vector to the get_sentiment() function, which consequently returns the
sentiment values based on the custom sentiment dictionary developed from
a collection of human coded sentences.
R Code:
round_time <- function(x, secs)
as.POSIXct(hms::round_hms(x, secs))
sent_scores <- function(x)
syuzhet::get_sentiment(plain_tweets(x)) - .5
df.sentiment <- gfg_data %>%
dplyr::mutate(days = round_time(created_at, 60 * 60 * 24),
sentiment = sent_scores(text)) %>%
dplyr::group_by(days) %>%
dplyr::summarise(sentiment = sum(sentiment, na.rm = TRUE))
ext Mining: Sentiment Analysis Once we have cleaned up our text and performed some basic word frequency analysis, the next step is to understand the opinion or emotion in the text. This is considered sentiment analysis and this tutorial will walk you through a simple approach to perform sentiment analysis.
To perform sentiment analysis we need to have our data in a tidy format. The following converts all seven Harry Potter novels into a tibble that has each word by chapter by book. See the tidy text tutorial for more details.
Using the same package, we now use the
get_nrc_sentiment() function to get the emotions and
valences from the NRC sentiment dictionary for each tweet. First, we
must clean the text column from our dataset
This syuzhet function calculates the presence of emotions and their valence for each tweet, where the text is cleaned so that each row holds a sentence from our dataset.
The columns include one for each emotion type as well as a positive or negative valence. The ten columns are as follows: “anger”, “anticipation”, “disgust”, “fear”, “joy”, “sadness”, “surprise”, “trust”, “negative”, “positive.”
R Code:
cleanTweet = as.vector(df$text)
cleanTweet = gsub("rt|RT", "", cleanTweet)
cleanTweet = gsub("http\\w+", "", cleanTweet)
cleanTweet = gsub("<.*?>", "", cleanTweet)
cleanTweet = gsub("@\\w+", "", cleanTweet)
cleanTweet = gsub("[[:punct:]]", "", cleanTweet)
cleanTweet = gsub("\r?\n|\r", " ", cleanTweet)
cleanTweet = gsub("[[:digit:]]", "", cleanTweet)
cleanTweet = gsub("[ |\t]{2,}", "", cleanTweet)
cleanTweet = gsub("^ ", "", cleanTweet)
cleanTweet = gsub(" $", "", cleanTweet)
textSentiment <- get_nrc_sentiment(cleanTweet)
nrc_sentiment <- cbind(df, textSentiment) %>%
dplyr::select(created_at, anger, anticipation, disgust, fear,
joy, sadness, surprise, trust, negative, positive)